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Abstract 

In this paper FPGA based hardware co-simulation of an area and power efficient FIR 
filter for wireless communication systems is presented. The implementation is based 
on distributed arithmetic (DA) which substitutes multiply-and-accumulate operations 
with look up table (LUT) accesses. Parallel Distributed arithmetic (PDA) look up 
table approach is used to implement an FIR Filter taking optimal advantage of the look 
up table structure of FPGA using VHDL. The proposed design is hardware co¬ 
simulated using System Generator 10.1, synthesized with Xilinx ISE 10.1 software, 
and implemented on Virtex-4 based xc4vlx25-10ff668 target device. Results show that 
the proposed design operates at 17.5 MHz throughput and consumes 0.468Wpower 
with considerable reduction in required resources to implement the design as compared 
to Coregen and add/shift based design styles. Due to this reduction in required resources 
the proposed design can also be implemented on Spartan-3 FPGA device to provide 
cost effective solution for DSP and wireless communication applications. 

Keywords: FPGA, PDA, Simulation, Add/Shift, VHDL. 

INTRODUCTION 


T oday’s consumer electronics such as cellular phones and other multi- 
media and wireless devices often require digital signal processing 
(DSP) algorithms for several crucial operations (Allred et al., 2004). 
Due to a growing demand for such complex DSP applications, high 
performance, low-cost Soc implementations of DSP algorithms are receiving 
increased attention among researchers and design engineers. There is a 
constant requirement for efficient use of FPGA resources (Macpherson 
and Stewart, 2006) where occupying less hardware for a given system that 
can yield significant cost-related benefits: 

(i) Reduced power consumption; 

(ii) Area for additional application functionality; 

(iii) Potential to use a smaller, cheaper FPGA. 

Finite impulse response (FIR) digital filters are common DSP functions 
and are widely used in multiple applications like telecommunications, 
wireless/satellite communications, video and audio processing, biomedical 
signal processing and many others. On one hand, high development costs 
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and time-to-market factors associated with ASICs can be prohibitive for 
certain applications while, on the other hand, programmable DSP processors 
can be unable to meet desired performance due to their sequential-execution 
architecture (Longa and Miri, 2006). In this context, reconfigurable FPGAs 
offer a very attractive solution that balance high flexibility, time-to-market, 
cost and performance. Therefore, in this paper, an important DSP function 
i.e. FIR filter is implemented on Virtex-4 FPGA. The impulse response of 
an FIR filter may be expressed as: 

K 

y (i.i) 

k =i 

where C 15 C 2 .C K are fixed coefficients and the x 1? x 2 . x K are the 

input data words. A typical digital implementation will require K multiply - 
and-accumulate (MAC) operations, which are expensive to compute in 
hardware due to logic complexity, area usage, and throughput (White, 
1989). Alternatively, the MAC operations may be replaced by a series of 
look-up-table (LUT) accesses and summations. Such an implementation 
of the filter is known as distributed arithmetic (DA). 

where C1,C2.CK are fixed coefficients and the xl, x2. xK 

are the input data words. A typical digital implementation will require K 
multiply-and-accumulate (MAC) operations, which are expensive to 
compute in hardware due to logic complexity, area usage, and throughput 
(White, 1989). Alternatively, the MAC operations may be replaced by a 
series of look-up-table (LUT) accesses and summations. Such an 
implementation of the filter is known as distributed arithmetic (DA). 

DISTRIBUTED ARITHMETIC 

DISTRIBUTED ARITHMETIC (DA) is an efficient method for computing 
inner products when one of the input vectors is fixed. It uses look-up tables 
and accumulators instead of multipliers for computing inner products and 
has been widely used in many DSP applications such as DFT, DCT, 
convolution, and digital filters (White, 1989). The example of direct DA 
inner-product generation is shown in equation lwhere xk is a 2's-complement 
binary number scaled such that IxJ < 1. We may express each x k as 

N-l 

Xk = ~ bkO+ ^bkn2~ U (2.1) 

n =1 

where the bkn are the bits, 0 or 1, b k0 is the sign bit. Now combining 
equation 1.1 and 2.1 in order to express y in terms of the bits of x k then we 
get 
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r = £a[-fe+f> n 2-"] ( 2.2) 

k=\ n-\ 

The above equation 2.2 is the conventional form of expressing the inner 
product. Interchanging the order of the summations, gives us: 

Y = £[£OM2-" +£c*(-M (2.3) 

n=l k= 1 £=1 

The above equation 2.3 shows a DA computation where the bracketed 
term is given by 

K 

Y^Ckbkn (2.4) 

k =1 

Each b kn can have values of 0 and 1 so equation 2.4 can have 2 K possible 
values. Rather than computing these values on line, we may pre-compute 
the values and store them in a ROM. The input data can be used to directly 
address the memory and the result. After N such cycles, the memory contains 
the result, y. As an example, let us consider K = 4, C 7 = 0.45, C 2 = -0.65, C 3 
= 0.15, and C 4 = 0.55. The memory must contain all possible combinations 
(2 4 = 16 values) and their negatives in order to accommodate the term 


iV 

^ Ckbkn 


(2.5) 


k =1 


which occurs at the sign-bit time. As a consequence, 2 x 2K word ROM is 
needed. Figure 1 shows the simple structure that can be used to compute 
these equations. The S, signal is the sign-bit timing signal. The term x k 
may be written as 


Xk = -^\Xk-(-Xk)\ 


( 2 . 6 ) 


and in 2's-complement notation the negative of x k may be written as 

—Xk = —bk0+ ^ bkn2~ n + 2 -(a ^ -1) (2.7) 


71 = 1 


where the over score symbol indicates the complement of a bit. By substituting 
equation 2.1 & 2.7 into equation 2.6, we get 

x 


1 N-l 

: = - [~{bk 0 - bkO ) + £ (bkn -bkn)T" - 

2 „=i 


( 2 . 8 ) 
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dkn = bkn ~ bkn ^ 11 ^ ^ 


(2.9) 
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and 


ako = bko — bko 

where the possible values of the akn, including n- 0, are +1. Then equation 
2.8 may be written as 


Xk = ^-[Y t Okn2~ n -2~ (N ~ l) (2.11) 

2 77=0 

By substituting the value of x k from equation 2.11 into equation 1.1, we 
obtain 


Y = - £ C*[£ dknT n - T (N ~ l) ] 

2 k -1 n =0 

N-l 

Y = Q(bn)2~ n + 2~ {N ~ l) 2(0) 

77=0 


where 


k r< 

Qibn) = Y^—«ndQ(0) - 

^-| ^dkn 



( 2 . 12 ) 


(2.13) 


It may be seen that QffcnJ has only 2 (K1) possible amplitude values with a 
sign that is given by the instantaneous combination of bits. The computation 
of y is obtained by using a 2 (K1) word memory, a one-word initial condition 
register for Q(O), and a single parallel adder subtractor with the necessary 
control-logic gates. 

CIRCUIT DESCRIPTION 

The basic LUT-DA scheme on an FPGA would consist of three main 
components as shown in figurel. These are input registers, 4-input LUT 
unit and shifter/accumulator unit. 

Input Registers: To reduce the consumption of logic elements, RAM 
resources are used to implement the shift registers (Allred et al., 
2004) 
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Figure 1: LUT based DA implementation of 4-tap filter 
LUT Unit: To implement 4-input and 3-input LUT unit, an LUT table is 
used, which represent all the possible sum combinations of filter 
coefficients Figure 1. 

Shifter and Accumulator Unit: It consists of an accumulator and a shifter. 

PROPOSED WORK 

In DA implementation as the filter size K increases, the memory 
requirements grow exponentially as 2K. This problem is solved in this 
paper by breaking up the filter into smaller base DA filtering units that 
require less memory sizes and, less area. If the K tap filter is divided into m 
units of k tap base units (K = mxk), then the total memory requirement 
would be mx2 k memory words. The total number of clock cycles required 
for this implementation is B + [log2(m)]; the additional second term is the 
number of clock cycles required to implement an adder tree to calculate 
the sums of the units. Thus the decrease in throughput of this implementation 
is marginal. For instance, in this proposed design K = 41, instead of 2 41 in 
a full LUT implementation, we have chosen 12 partitions with k - 4 for m 
- 5 and k = 3 for m = 7 which would only require 136 memory words. 

In this proposed work a 41-tap low pass filter has been designed. The 
first step in design flow is to develop an optimized VHDL code using 
distributed Arithmetic Algorithm and implement it using black box of 
System generator to develop proposed model of design. Figure 2 shows 
the developed model of proposed design using various Simulink and System 
Generator blocks. 
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Generator 


FDA Tool Resource 

Estimator 


Figure2: Model for Hardware Co-simulation 

The part of model enclosed in green boundary shows the software based 
simulation whose output can be seen in figure 3, part of model enclosed in 
orange boundary shows hardware based simulation whose output can be 
seen in figure 4 and spectrum scope in blue boundary shows the comparison 
between software and hardware based simulation whose output is shown 
in figure 5. 



Figure 3: Software Based Simulation 
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Figure 4: Hardware Based Simulation 
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Figure 5: S/W & H/W Based Simulation Comparison 

The output wave form with green color in figure 5 means complete matching 
of software based simulation with hard ware based simulation without errors. 
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RESULTS 

The proposed design is implemented on Virtex-4 based xc4vlx25-10ff668 
target FPGA. 

Table 1 shows the comparison of proposed PDA design with the 
published add-shift and coregen based PDA (Mirzaei et al., 2006) 
implemented on Virtex-4 device. It can be seen from the table that the 
throughput and performance of the proposed design are 17.50 MHz and 
210 Msps respectively which are almost equal to other compared designs. 


Table 1: Virtex-4 Based Comparison PDA, Coregen & Add/Shift 


Design 

Style 

Slices 

LUTs 

FFs 

Throughput 

flUHz) 

Performance 

(Msps) 

Add-Shift 

2154 

1719 

4161 

18.58 

225 

Co re gen 
PDA 

2475 

5642 

4748 

18.50 

222 

Proposed 

PDA 

1840 

54 6" 

29S5 

17.50 

210 


Figure 6 shows the comparison of area utilization between add/Shift, PDA 
(coregen) and proposed PDA (PPDA) for 41 tap filter designs. It can be 
observed that the PPDA uses considerably less amount of resources on the 
target device as compared to other compared designs. Due to this reduction 
in required resources the proposed design can be implemented on Spartan- 
3 FPGA as shown in table2. 



Figure 6: Area Comparison of Add Shift and PDA with PPDA 
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Table 2: Spartan-3 Based Implementation 


Design 

Style 

Slices 

(R/A)* 

LUTs 

(E/A)* 

FFs 

(R/A)* 

Throughput 

(MHz) 

Performance 

(Msps) 

Proposed 

PDA 

1840/1920 

346-3840 

29853840 

8.77 

105.22 

MAC 

Parallel 

2046 1920 

3193 3840 

13233840 

- 

- 

Add-Shift 

2154.1920 

1719/3840 

4161/3840 

- 

- 

Co re gen 
PDA 

2475/1920 

3642/3840 

47483840 

- 

- 


(RA)*: Resources required Resources available on target FPGA 


FPGA Based 
Hardware 
Co-Simulation 


121 


Table 3 Shows that the Proposed Design Consumes Total Power of 0.468W 
at 31.3 Degrees C Junction Temperature. 

Table 3: Power Consumption 


Name 

Value 

Used 

Total Available 

Utilization (%) 

Clocks; 

0.04479 (W) 

1 

-r 


Logic 

0.00000 (W) 

3477 

21504 

16.2 

Signals 

0.00000 (W) 

5239 



10s 

0.00000 (W) 

27 

450 

6.0 

DCMs 

0.00000 (W) 

0 

S 

0.0 






Total Quiescent Power 

0.42292 (W] 




T otal Dynamic Power 

0.04479 [W] 




T otal Power 

0.4G772 [W] 










31.3 (degrees C] 





CONCLUSIONS 

In this paper, a Parallel Distributed Arithmetic algorithm for high 
performance reconfigurable FIR filter is presented to enhance the area & 
power efficiency. The proposed design is taking optimal advantage of look 
up table structure of target FPGA. The throughput and performance of the 
proposed design are 17.5 MHz and 210 Msps respectively with considerable 
amount of reduction in used resources. Due to this reduction in required 
resources the proposed design can be implemented on Spartan-3 FPGA 
device to provide cost effective solution for DSP and wireless 
communication applications. 
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